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Abstract 

In this paper, we have used Recurrent Neural Networks to capture and model 
human motion data and generate motions by prediction of the next immediate 
data point at each time-step. Our RNN is armed with recently proposed Gated 
Recurrent Units which has shown promissing results in some sequence modeling 
problems such as Machine Translation and Speech Synthesis. We demonstrate 
that this model is able to capture long-term dependencies in data and generate 
realistic motions. 


1 Introduction 

Sequence modeling has been a challenging problem in Machine Learning that requires models which 
are able to capture temporal dependencies. One of the early models for sequence modeling was 
Hidden Markov Model (HMM) [1]. HMMs are able to capture data distribution using multinomial 
latent variables. In this model, each data point at time t is conditioned on the hidden state at time t. 
And hidden state at time t is conditioned on hidden state at time t — 1. In HMMs both P(x t \s t ) and 
P(st\st~i) are same for all time-steps. A similar idea of parameter sharing is used in Recurrent 
Neural Network (RNN) [2]. RNNs are an extention of feedforward neural networks which their 
weights are shared for every time-step in data. Consequently, we can apply RNNs to sequential 
input data. 

Theoretically, RNNs are capable of capturing sequences with arbitrary complexity. Unfortunately, as 
shown by Bengio et al. [3], there are some difficulties during training RNNs on sequences with long¬ 
term dependencies. Among lots of solutions for RNNs’ training problems over past few decades, 
we use Gated Recurrent Units which is recently proposed by Cho et al. [4]. As it shown by Chung 
et al. [5], Gated Recurrent Unit performs much more better than conventional Tank units. 

In the folowing sections, we are going to introduce the model, train it on the MIT motion database 

[6] , and show that it is capable of capturing complexities of human body motions. Then we demon¬ 
strate that we are able to generate sequences of motions by predicting the next immediate data point 
given all previous data points. 

2 Recurrent Neural Network 

Simple Recurrent Neural Network which has been shown to be able to implement a Turing Machine 

[7] is an extension of feedforward neural networks. The idea in RNNs is that they share parameters 
for different time-steps. This idea which is called parameter sharing enables RNNs to be used for 
sequential data. 

RNNs have memory and can memorized input values for some period of time. More formally, if 
the input sequence is x = {xi, X2,xjy} then each hidden state is function of current input and 
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previous hidden state. 


ht — Fq {ht—1 5 %t ) 

which Fq is a linear regression followed by a non-linearity. 

h t = H(yVhhht-i,W x hXt) 

where H is a non-linear function which in Vanilla RNN is conventional Tank. It can be easily 
shown using the above equation that each hidden state h t is a function of all previous inputs. 

h t = Ge(x 1 ,x 2 , :.,x t ) 

where Gq is a very complicated and non-linear function which summerizes all previous inputs in h t . 
A trained Gq puts more emphasis on some aspects of some of the previous inputs trying to minimize 
overal cost of the network. 

Finally, in the case of real-valued outputs, output in time-step t can be computed as follows 

Vt = W h yh t 

Note that bias vectors are omitted to keep the notation simple. A graphical illustration of RNNs is 
shown in figure 1. 
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Figure 1: Left: A Recurrent Neural Network with recurrent connection from hidden units to them¬ 
selves. Right: A same network but unfolded in time. Note that weight matrices are same for every 
time-step. 


2.1 Generative Recurrent Neural Network 

We can use a Recurrent Neural Network as a generative model in a way that the output of the network 
in time-step t — 1 defines a probability distribution over the next input at time-step t. According to 
chain rule, we can write the joint probability distribution over the input sequence as follows. 

P(xi,x 2 , -,X N ) = P(xi)P(x 2 \xi)...P(x T \xu ...,X T -l) 

Now we can model each of these conditional probability distributions as a function of hidden states. 


P(xt\xi, = f(ht) 

Obviously, since h t is a fixed length vector and {xi,..., x t ~i} is a variable length sequence, it can 
be considered as a lossy compression. During learning process, the network should learn to keep 
important information (according to the cost function) and throw away useless information. Thus, 
in practice network just look at some time-steps back until x t ~k- The architecture of a Generative 
Recurrent Neural Network is shown in figure 2. 

Unfortunately, as shown by Bengio et al. [3], there are some optimization issues when we try to train 
such models with long-term dependency in data. The problem is that when an error occurs, as we 
back-propagate it through time to update the parameters, the gradient may decay exponentially to 
zero (Gradient Vanishing) or get exponentially large. For the problem of huge gradients, an ad-hoc 
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Figure 2: An unfolded Generative Recurrent Neural Networks which output at time-step t— 1 defines 
a conditional probability distribution over the next input. Dashed-lines are during generating phase. 


solution is to restrict the gradient not to go over a threshold. This technique is known as gradient 
clipping. But the solution for Gradient Vanishing is not trivial. Over past few decades, several 
methods were proposed [e.g. 8, 9, 10] to tackle this problem. Although the problem still remains, 
gating methods have shown promissing results in comparison with Vanilla RNN in different task 
such as Speech Recognition [11], Machine Translation [12], and Image Caption Generation [13]. 
One of the models which exploits a gating mechanism is Gated Recurrent Unit [4]. 

2.2 Gated Recurrent Unit 

Gated Recurrent Unit (GRU) is different from simple RNN in a sense that in GRU, each hidden unit 
has two gates. These gates are called update and reset gates which control the flow of information 
inside each hidden unit. Each hidden state at time-step t is computed as follows, 

ht = (1 — Zt) o ht —i + z t oh t 

where o is an element wise product, z t is update gate, and h t is the candidate activation. 

h t = tanh(W x hXt + W hh (r t o h t - 1 )) 

where r t is the reset gate. Both update and reset gates are computed using a sigmoid function: 

zt = c(W xz Xt + Whzht- i ) 

Tf = (7{yV xr Xf -f- Wh r hi— i) 

where Ws are weight matrices for both gates. A Gated Recurrent Unit is shown in figure 3. 
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Figure 3: A Gated Recurrent Unit which r and 2 are the reset and update gates, and h and h are the 
activation and the candidate activation. [5] 


3 Experimental results 

In this section we describe motion dataset and results for modeling and generating human motions. 
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3.1 Ability to capture long-term dependency 


Before training model on motion data let’s first compare GRU with conventional Tank. As we 
discussed in section 2, due to optimization problems, simple RNNs are not able to capture long¬ 
term dependency in data (Gradient Vanishing problem). Thus, instead of using Tank activation 
function, we use GRU. Here we try to show that GRU performs much more better. The task is 
to read a sequence of random numbers, memorize them for some periods of time, and then emit 
a function which is sum over input value. We generated 100 different sequences, each containing 
20 rows (time-steps) and 2 columns (attributes of each datapoint). We trained the models such that 
output at time t (y t ) is a function of previous input values. 


Vt = ar t _ 3 [0] +x(t- 5)[1] 


Hence, we expect models to memorize 5 time-steps back and learn when to use which dimensions 
of the previous inputs. For both models, input is a vector of size 2, output is a scaler value, and a 
single hidden layer has 7 units. We allowed both networks to overfit on training data. It is shown in 
figure 4 that the model with GRU is able to perform very well while simple Tank cannot capture. 



Figure 4: First two graphs show two input signals of the model for 20 time-steps. Third and fourth 
graphs are associated with Tank and GRU respectively. The solid line is real target and the dashed- 
line is the model output. As it is clear, the fourth graph performs much more better and is able to 
model target signal very well. 


3.2 Dataset 

Among some Motion Capture (MOCAP) datasets, we used simple walking motion from MIT Mo¬ 
tion dataset [6]. The dataset is generated by filming a man wearing a cloth with 17 small lights 
which determine position of body joints. Each data point in our dataset consists of information 
about global orientation and displacement. To be able to generate more realistic motions, we used 
same preprocessing as used by Taylor et al. [14]. Our final dataset contains 375 rows where each 
row contains 49 ground-invarient, zero mean, and unit variance features of body joints during walk¬ 
ing. We also used Neil Lawrences motion capture toolbox to visualize data in 3D spcae. Samples of 
data are shown in figure 5. 
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Figure 5: A sequence of frames of walking from MIT motion dataset. Each frame has 17 light points 
on the body. [6] 


3.3 Motion generation 

We trained our GRU Recurrent Neural Network which has 49 input units and 120 hidden units in a 
single hidden layer. Then, we use it in a generative fashion which each output at time t is fed to the 
model as x t +\. To initialize the model, we first feed the model with 50 frames of the training data 
and then let the model to generate arbitrary length sequence. Regeneration quality is good enough 
in a way that it cannot be distinguished from real trining data by the naked eye. In figure 6 average 
over all 49 features is ploted for better visualization. The initialization and generation phases are 
shown in figure 7. 


Average over target (delayed input) and generated motions 



Figure 6: Average (for visualization) of all 49 features is shown in solid line for real target and in 
dashed-line for generated data. 


4 Conclusion 

In this paper we have demonstrated that Gated Recurrent Unit helps optimization problems of Re¬ 
current Neural Network when there is long-term dependency in data. We did our experiments dis- 
criminatively using a toy example dataset and generatively using MIT motion dataset and showed 
that GRU performs much better than simple Recurrent Neural Networks with conventional Tank 
activation function in both tasks of memorizing and generating. 
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Average over initial and generated motions 



Figure 7: This graph shows the initialization using real data for first 50 time-steps in solid blue line 
and the rest is generated sequence in green dashed-line. 
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